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Abstract 

A standard form of analysis for linguis- 
tic typology is the universal implication. 
These implications state facts about the 
range of extant languages, such as "if ob- 
jects come after verbs, then adjectives come 
after nouns." Such implications are typi- 
cally discovered by painstaking hand anal- 
ysis over a small sample of languages. We 
propose a computational model for assist- 
ing at this process. Our model is able to 
discover both well-known implications as 
well as some novel implications that deserve 
further study. Moreover, through a careful 
application of hierarchical analysis, we are 
able to cope with the well-known sampling 
problem: languages are not independent. 

1 Introduction 

Linguistic typology aims to distinguish between 
logically possible languages and actually observed 
languages. A fundamental building block for 
such an understanding is the universal implica- 
tion (Greenberg, 1963 1. These are short state- 



ments that restrict the space of languages in a 
concrete way (for instance "object-verb ordering 
implies adjective-noun ordering"); [Croft (2003] l, 
[Hawkins (1983| ) and [Song (200T] ) provide excellent 
introductions to linguistic typology. We present 
a statistical model for automatically discovering 
such implications from a large typological database 
Paspelmath et al, 2005) . 

Analyses of universal implications are typically 
performed by linguists, inspecting an array of 30- 



100 languages and a few pairs of features. Looking 
at all pairs of features (typically several hundred) is 
virtually impossible by hand. Moreover, it is insuf- 
ticient to simply look at counts. For instance, results 
presented in the form "verb precedes object implies 
prepositions in 16/19 languages" are nonconclusive. 
While compelling, this is not enough evidence to de- 
cide if this is a statistically well-founded implica- 
tion. For one, maybe 99% of languages have prepo- 
sitions: then the fact that we've achieved a rate of 
84% actually seems really bad. Moreover, if the 16 
languages are highly related historically or areally 
(geographically), and the other 3 are not, then we 
may have only learned something about geography. 

In this work, we propose a statistical model that 
deals cleanly with these difficulties. By building a 
computational model, it is possible to apply it to 
a very large typological database and search over 
many thousands of pairs of features. Our model 
hinges on two novel components: a statistical noise 
model a hierarchical inference over language fam- 
ilies. To our knowledge, there is no prior work 
directly in this area. The closest work is repre- 
sented by the books Possible and Probable Lan- 



guages (Newmeyer, 2005 1 and Language Classifica- 
tion by Numbers ( [McMahon and McMahon, 2005] l, 
but the focus of these books is on automatically dis- 
covering phylogenetic trees for languages based on 
Indo-European cognate sets (Dyen et al., 1992^ . 

2 Data 

The database on which we perform our analy- 
sis is the World Atlas of Language Structures 
( [Haspehnath et al., 2005 1. This database contains 
information about 2150 languages (sampled from 
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Table 1 : Example database entries for a selection of diverse languages and features. 

This is one particular form of noise. Another source 
of noise stems from transcription. WALS contains 
data about languages documented by field linguists 
as early as the 1900s. Much of this older data was 
collected before there was significant agreement in 
documentation style. Different field linguists of- 
ten had different dimensions along which they seg- 
mented language features into classes. This leads to 
noise in the properties of individual languages. 

Another difficulty stems from the sampling prob- 
lem. This is a well-documented issue (see, eg., 
( |Croft, 2003] )) stemming from the fact that any set of 
languages is not sampled uniformly from the space 
of all probable languages. Politically interesting 
languages (eg., Indo-European) and typologically 
unusual languages (eg., Dyirbal) are better docu- 
mented than others. Moreover, languages are not in- 
dependent: German and Dutch are more similar than 
German and Hindi due to history and geography. 

The first model. Flat, treats each language as in- 
dependent. It is thus susceptible to sampling prob- 
lems. For instance, the WALS database contains a 
half dozen versions of German. The Flat model 
considers these versions of German just as statisti- 
cally independent as, say, German and Hindi. To 
cope with this problem, we then augment the Flat 
model into a HiERarchical model that takes advan- 
tage of known hierarchies in linguistic phylogenet- 
ics. The HiER model explicitly models the fact that 
individual languages are not independent and exhibit 
strong familial dependencies. In both models, we 
initially restrict our attention to pairs of features. We 
will describe our models as if all features are binary. 
We expand any multi-valued feature with K values 
into K binary features in a "one versus rest" manner. 

3.1 The Flat Model 

In the Flat model, we consider a 2 x matrix of 
feature values. The A'^ corresponds to the number of 
languages, while the 2 corresponds to the two fea- 
tures currently under consideration (eg., object/verb 



Figure 1 : Map of the 2150 languages in the database. 

across the world; Figure [T] depicts the locations of 
languages). There are 139 features in this database, 
broken down into categories such as "Nominal Cate- 
gories," "Simple Clauses," "Phonology," "Word Or- 
der," etc. The database is sparse: for many lan- 
guage/feature pairs, the feature value is unknown. In 
fact, only about 16% of all possible language/feature 
pairs are known. A sample of five languages and six 
features from the database are shown in Table [T] 

Importantly, the density of samples is not random. 
For certain languages (eg., English, Chinese, Rus- 
sian), nearly all features are known, whereas other 
languages (eg., Asturian, Omagua, Frisian) that have 
fewer than five feature values known. Furthermore, 
some features are known for many languages. This 
is due to the fact that certain features take less effort 
to identify than others. Identifying, for instance, if 
a language has a particular set of phonological fea- 
tures (such as glottalized consonants) requires only 
listening to speakers. Other features, such as deter- 
mining the order of relative clauses and nouns re- 
quire understanding much more of the language. 
3 Models 

In this section, we propose two models for automat- 
ically uncovering universal implications from noisy, 
sparse data. First, note that even well attested impli- 
cations are not always exceptionless. A common ex- 
ample is that verbs preceding objects ("VO") implies 
adjectives following nouns ("NA"). This implication 
(VO D NA) has one glaring exception: English. 



order and noun/adjective order). The order of the 
two features is important: /i implies /2 is logically 
different from /2 implies /i . Some of the entries in 
the matrix will be unknown. We may safely remove 
all languages from consideration for which both are 
unknown, but we do not remove languages for which 
only one is unknown. We do so because our model 
needs to capture the fact that if /2 is always true, 
then /i D /2 is uninteresting. 

The statistical model is set up as follows. There is 
a single variable (we will denote this variable "m") 
corresponding to whether the implication holds. 
Thus, m = 1 means that /i implies /2 and m = 
means that it does not. Independent of m, we specify 
two feature priors, tti and 7r2 for /i and /2 respec- 
tively. TTi specifies the prior probability that /i will 
be true, and 1^2 specifies the prior probability that /2 
will be true. One can then put the model together 
naively as follows. If m = (i.e., the implication 
does not hold), then the entire data matrix is gener- 
ated by choosing values for /i (resp., /2) indepen- 
dently according to the prior probability tti (resp., 
7r2). On the other hand, if m = 1 (i.e., the impli- 
cation does hold), then the first column of the data 
matrix is generated by choosing values for /i inde- 
pendently by vTi, but the second column is generated 
differently. In particular, if for a particular language, 
we have that /i is true, then the fact that the implica- 
tion holds means that /2 must be true. On the other 
hand, if /i is false for a particular language, then we 
may generate /2 according to the prior probability 
7r2. Thus, having m = \ means that the model is 
significantly more constrained. In equations: 

ki) = 7r/i(l-7ri)^-^'i 

(f\f ^\ / -^2 m = /i = 1 

p(M /i , TT^ , m) = I ( ^ _ ) 1-/. ^ji^^^^i,^ 

The problem with this naive model is that it does 
not take into account the fact that there is "noise" 
in the data. (By noise, we refer either to mis- 
annotations, or to "strange" languages like English.) 
To account for this, we introduce a simple noise 
model. There are several options for parameteriz- 
ing the noise, depending on what independence as- 
sumptions we wish to make. One could simply spec- 
ify a noise rate for the entire data set. One could 
alternatively specify a language-specific noise rate. 
Or one could specify a feature-specific noise rate. 
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Figure 2: Graphical model for the FLAT model. 

We opt for a blend between the first and second op- 
tion. We assume an underlying noise rate for the en- 
tire data set, but that, conditioned on this underlying 
rate, there is a language-specific noise level. We be- 
lieve this to be an appropriate noise model because it 
models the fact that the majority of information for 
a single language is from a single source. Thus, if 
there is an error in the database, it is more likely that 
other errors will be for the same languages. 

In order to model this statistically, we assume that 
there are latent variables ei^n and e2,n for each lan- 
guage n. If ei^n = 1> then the first feature for lan- 
guage n is wrong. Similarly, if 62, n ~ 

1, then the 

second feature for language n is wrong. Given this 
model, the probabilities are exactly as in the naive 
model, with the exception that instead of using /i 
(resp., /2), we use the exclusive-oi0 /i (g) ei (resp., 
/2 ® 62) so that the feature values are flipped when- 
ever the noise model suggests an error. 

The graphical model for the FLAT model is shown 
in Figure [2] Circular nodes denote random variables 
and arrows denote conditional dependencies. The 
rectangular plate denotes the fact that the elements 
contained within it are replicated times {N is the 
number of languages). In this model, there are four 
"root" nodes: the implication value m; the two fea- 
ture prior probabilities vri and 712; and the language- 
specific error rate e. On all of these nodes we place 
Bayesian priors. Since m is a binary random vari- 
able, we place a Bernoulli prior on it. The vrs are 
Bernoulli random variables, so they are given inde- 
pendent Beta priors. Finally, the noise rate e is also 
given a Beta prior. For the two Beta parameters gov- 
erning the error rate (i.e., and b^) we set these by 
hand so that the mean expected error rate is 5% and 
the probability of the error rate being between 0% 
and 10% is 50% (this number is based on an expert 

'The exclusive-or of a and b, written a ® 6, is true exactly 
when either a or 6 is true but not both. 



opinion of the noise-rate in the data). For the rest of 
the parameters we use uniform priors. 

3.2 The HiER Model 

A significant difficulty in working with any large ty- 
pological database is that the languages will be sam- 
pled nonuniformly. In our case, this means that im- 
plications that seem true in the Flat model may 
only be true for, say, Indo-European, and the remain- 
ing languages are considered noise. While this may 
be interesting in its own right, we are more interested 
in discovering implications that are truly universal. 

We model this using a hierarchical Bayesian 
model. In essence, we take the Flat model and 
build a notion of language relatedness into it. In 
particular, we enforce a hierarchy on the m impli- 
cation variables. For simplicity, suppose that our 
"hierarchy" of languages is nearly flat. Of the N 
languages, half of them are Indo-European and the 
other half are Austronesian. We will use a nearly 
identical model to the Flat model, but instead of 
having a single m variable, we have three: one for 
IE, one for Austronesian and one for "all languages." 

For a general tree, we assign one implication vari- 
able for each node (including the root and leaves). 
The goal of the inference is to infer the value of the 
m variable corresponding to the root of the tree. 

All that is left to specify the full HiER model 
is to specify the probability distribution of the m 
random variables. We do this as follows. We 
place a zero mean Gaussian prior with (unknown) 
variance cr^ on the root m. Then, for a non-root 
node, we use a Gaussian with mean equal to the 
"m" value of the parent and tied variance cr^. In 
our three-node example, this means that the root is 
distributed 7Vbr(0, cr^) and each child is distributed 
J\for{mroot, f^), where m^oot is the random variable 
corresponding to the root. Finally, the leaves (cor- 
responding to the languages themselves) are dis- 
tributed logistic-binomial. Thus, the m random vari- 
able corresponding to a leaf (language) is distributed 
H?i(s(mpar)), where mpar is the m value for the par- 
ent (internal) node and s is the sigmoid function 
s{x) = [1 + exp{—x)]^^. 

The intuition behind this model is that the m value 
at each node in the tree (where a node is either "all 
languages" or a specific language family or an in- 
dividual language) specifies the extent to which the 



implication under consideration holds for that node. 
A large positive m means that the implication is very 
likely to hold. A large negative value means it is 
very likely to not hold. The normal distributions 
across edges in the tree indicate that we expect the 
m values not to change too much across the tree. At 
the leaves (i.e., individual languages), the logistic- 
binomial simply transforms the real-valued ms into 
the range [0, 1] so as to make an appropriate input to 
the binomial distribution. 

4 Statistical Inference 

In this section, we describe how we use Markov 
chain Monte Carlo methods to perform inference 
in the statistical models described in the previous 
section; Andrieu et al. (2003) provide an excellent 
introduction to MCMC techniques. The key idea 
behind MCMC techniques is to approximate in- 
tractable expectations by drawing random samples 
from the probability distribution of interest. The ex- 
pectation can then be approximated by an empirical 
expectation over these sample. 

For the Flat model, we use a combination of 
Gibbs sampling with rejection sampling as a sub- 
routine. Essentially, all sampling steps are standard 
Gibbs steps, except for sampling the error rates e. 
The Gibbs step is not available analytically for these. 
Hence, we use rejection sampling (drawing from the 
Beta prior and accepting according to the posterior). 

The sampling procedure for the HiER model is 
only slightly more complicated. Instead of perform- 
ing a simple Gibbs sample for m in Step (4), we 
first sample the m values for the internal nodes us- 
ing simple Gibbs updates. For the leaf nodes, we 
use rejection sampling. For this rejection, we draw 
proposal values from the Gaussian specified by the 
parent m, and compute acceptance probabilities. 

In all cases, we run the outer Gibbs sampler for 
1000 iterations and each rejection sampler for 20 it- 
erations. We compute the marginal values for the m 
implication variables by averaging the sampled val- 
ues after dropping 200 "burn-in" iterations. 

5 Data Preprocessing and Search 

After extracting the raw data from the WALS elec- 
tronic database (Haspelmath et al., 2005 1, we per- 
form a minor amount of preprocessing. Essen- 



This is nontrivial — we are currently exploring the possibil- 
ity of freely sharing these data. 



tially, we have manually removed certain feature 
values from the database because they are underrep- 
resented. For instance, the "Glottalized Consonants" 
feature has eight possible values (one for "none" 
and seven for different varieties of glottalized conso- 
nants). We reduce this to simply two values "has" or 
"has not." 313 languages have no glottalized conso- 
nants and 139 have some variety of glottalized con- 
sonant. We have done something similar with ap- 
proximately twenty of the features. 

For the HiER model, we obtain the hierarchy in 
one of two ways. The first hierarchy we use is the 
"linguistic hierarchy" specified as part of the WALS 
data. This hierarchy divides languages into families 
and subfamilies. This leads to a tree with the leaves 
at depth four. The root has 38 immediate children 
(corresponding to the major families), and there are 
a total of 314 internal nodes. The second hierar- 
chy we use is an areal hierarchy obtained by clus- 
tering languages according to their latitude and lon- 
gitude. For the clustering we first cluster all the lan- 
guages into 6 "macro-clusters." We then cluster each 
macro-cluster individually into 25 "micro-clusters." 
These micro-clusters then have the languages at their 
leaves. This yields a tree with 31 internal nodes. 

Given the database (which contains approxi- 
mately 140 features), performing a raw search even 
over all possible pairs of features would lead to over 
19, 000 computations. In order to reduce this space 
to a more manageable number, we filter: 

• There must be at least 250 languages for which both fea- 
tures are known. 

• There must be at least 15 languages for which both fea- 
ture values hold simultaneously. 

• Whenever /i is true, at least half of the languages also 
have /2 true. 

Performing all these filtration steps reduces the 
number of pairs under consideration to 3442. While 
this remains a computationally expensive procedure, 
we were able to perform all the implication compu- 
tations for these 3442 possible pairs in about a week 
on a single modem machine (in Matlab). 
6 Results 

The task of discovering universal implications is, at 
its heart, a data-mining task. As such, it is difficult 
to evaluate, since we often do not know the correct 
answers ! If our model only found well-documented 
implications, this would be interesting but useless 



from the perspective of aiding linguists focus their 
energies on new, plausible implications. In this sec- 
tion, we present the results of our method, together 
with both a quantitative and qualitative evaluation. 

6.1 Quantitative Evaluation 

In this section, we perform a quantitative evaluation 
of the results based on predictive power. That is, 
one generally would prefer a system that finds im- 
plications that hold with high probability across the 
data. The word "generally" is important: this qual- 
ity is neither necessary nor sufficient for the model 
to be good. For instance, finding 1000 implications 
of the form Ai D X, D X, . . . , ^1000 D X is 
completely uninteresting if X is true in 99% of the 
cases. Similarly, suppose that a model can find 1000 
implications of the form X Ai, . . . , X ^iooo> 
but X is only true in five languages. In both of these 
cases, according to a "predictive power" measure, 
these would be ideal systems. But they are both 
somewhat uninteresting. 

Despite these difficulties with a predictive power- 
based evaluation, we feel that it is a good way to un- 
derstand the relative merits of our different models. 
Thus, we compare the following systems: Flat (our 
proposed flat model), LiNGHlER (our model using 
the phylogenetic hierarchy), DiSTHlER (our model 
using the areal hierarchy) and RANDOM (a model 
that ranks implications — that meet the three qualifi- 
cations from the previous section — randomly). 

The models are scored as follows. We take the 
entire WALS data set and "hide" a random 10% 
of the entries. We then perform full inference and 
ask the inferred model to predict the missing val- 
ues. The accuracy of the model is the accuracy of 
its predictions. To obtain a sense of the quality of 
the ranking, we perform this computation on the 
top k ranked implications provided by each model; 
k G {2,4,8,... ,512,1024}. 

The results of this quantitative evaluation are 
shown in Figure [3] (on a log-scale for the x-axis). 
The two best-performing models are the two hier- 
archical models. The flat model does significantly 
worse and the random model does terribly. The ver- 
tical lines are a standard deviation over 100 folds of 
the experiment (hiding a different 10% each time). 
The difference between the two hierarchical mod- 
els is typically not statistically significant. At the 
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Figure 3: Results of quantitative (predictive) evalua- 
tion. Top curves are the hierarchical models; middle 
is the flat model; bottom is the random baseline. 

top of the ranking, the model based on phylogenetic 
information performs marginally better; at the bot- 
tom of the ranking, the order flips. Comparing the 
hierarchical models to the flat model, we see that 
adequately modeling the a priori similarity between 
languages is quite important. 

6.2 Cross-model Comparison 

The results in the previous section support the con- 
clusion that the two hierarchical models are doing 
something significantly different (and better) than 
the flat model. This clearly must be the case. The 
results, however, do not say whether the two hierar- 
chies are substantially different. Moreover, are the 
results that they produce substantially different. The 
answer to these two questions is "yes." 

We first address the issue of tree similarity. We 
consider all pairs of languages which are at distance 
in the areal tree (i.e., have the same parent). We 
then look at the mean tree-distance between those 
languages in the phylogenetic tree. We do this for all 
distances in the areal tree (because of its construc- 
tion, there are only three: 0, 2 and 4). The mean 
distances in the phylogenetic tree corresponding to 
these three distances in the areal tree are: 2.9, 3.5 
and 4.0, respectively. This means that languages that 
are "nearby" in the areal tree are quite often very far 
apart in the phylogenetic tree. 

To answer the issue of whether the results ob- 
tained by the two trees are similar, we employ 
Kendall's r statistic. Given two ordered lists, the 
T statistic computes how correlated they are. r is 
always between and 1, with 1 indicating identical 



ordering and indicated completely reversed order- 
ing. The results are as follows. Comparing FLAT 
to LingHier yield r = 0.4144, a very low correla- 
tion. Between Flat and DistHier, r = 0.5213, 
also very low. These two are as expected. Fi- 
nally, between LiNGHiER and DistHier, we ob- 
tain r = 0.5369, a very low correlation, considering 
that both perform well predictively. 

6.3 Qualitative Analysis 

For the purpose of a qualitative analysis, we re- 
produce the top 30 implications discovered by the 
LingHier model in Table |2] (see the final page)ll 
Each implication is numbered, then the actual im- 
plication is presented. For instance, #7 says that any 
language that has adjectives preceding their govern- 
ing nouns also has numerals preceding their nouns. 
We additionally provide an "analysis" of many 
of these discovered implications. Many of them 
(eg., #7) are well known in the typological litera- 
ture. These are simply numbered according to well- 
known references. For instance our #7 is implication 



#18 from Greenberg, reproduced by Song (2001 



Those that reference Hawkins (eg., #11) are based 
on implications described by [Hawkins (1983 1; those 
that reference Lehmann are references to the princi- 
ples decided by Lehmann (1981) in Ch 4 & 8. 

Some of the implications our model discovers 
are obtained by composition of well-known implica- 
tions. For instance, our #3 (namely, OV D Genitive- 
Noun) can be obtained by combining Greenberg #4 
(OV D Postpositions) and Greenberg #2a (Postpo- 
sitions D Genitive-Noun). It is quite encouraging 
that 14 of our top 21 discovered implications are 
well-known in the literature (and this, not even con- 
sidering the tautalogically true implications)! This 
strongly suggests that our model is doing something 
reasonable and that there is true structure in the data. 

In addition to many of the known implications 
found by our model, there are many that are "un- 
known." Space precludes attempting explanations 
of them all, so we focus on a few. Some are easy. 



'in truth, our model discovers several tautalogical implica- 
tions that we have removed by hand before presentation. These 
are examples like "SVO 3 VO" or "No unusual consonants D 
no glottalized consonants." It is, of course, good that our model 
discovers these, since they are obviously true. However, to save 
space, we have withheld them from presentation here. The 30th 
implication presented here is actually the 83rd in our full list. 



Consider #8 (Strongly suffixing D Tense-aspect suf- 
fixes): this is quite plausible — if you have a lan- 
guage that tends to have suffixes, it will probably 
have suffixes for tense/aspect. Similarly, #10 states 
that languages with verb morphology for questions 
lack question particles; again, this can be easily ex- 
plained by an appeal to economy. 

Some of the discovered implications require a 
more involved explanation. One such example is 
#20: labial-velars implies no uvulars0 It turns out 
that labial-velars are most common in Africa just 
north of the equator, which is also a place that has 
very few uvulars (there are a handful of other ex- 
amples, mostly in Papua New Guinea). While this 
implication has not been previously investigated, it 
makes some sense: if a language has one form of 
rare consonant, it is unlikely to have another. 

As another example, consider #28: Obligatory 
suffix pronouns implies no possessive affixes. This 
means is that in languages (like English) for which 
pro-drop is impossible, possession is not marked 
morphologically on the head noun (like English, 
"book" appears the same regarless of if it is "his 
book" or "the book"). This also makes sense: if you 
cannot drop pronouns, then one usually will mark 
possession on the pronoun, not the head noun. Thus, 
you do not need marking on the head noun. 

Finally, consider #25: High and mid front vowels 
(i.e., /u/, etc.) implies large vowel inventory (> 7 
vowels). This is supported by typological evidence 
that high and mid front vowels are the "last" vowels 
to be added to a language's repertoire. Thus, in order 
to get them, you must also have many other types of 
vowels already, leading to a large vowel inventory. 

Not all examples admit a simple explanation and 
are worthy of further thought. Some of which (like 
the ones predicated on "SV") may just be peculiar- 
ities of the annotation style: the subject verb order 
changes frequently between transitive and intransi- 
tive usages in many languages, and the annotation 
reflects just one. Some others are bizzarre: why not 
having fricatives should mean that you don't have 
tones (#27) is not a priori clear. 



■^Labial-velars and uvulars are rare consonants (order 100 
languages). Labial-velars are joined sounds like /kp/ and /gb/ 
(to English speakers, sounding like chicken noises); uvulars 
sounds are made in the back of the throat, like snoring. 
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Postpositions 



D Demonstrative-Noun 



Adjective-Noun " 

Posessive prefixes ^ „ 
.r, , rc 3 Genitive-Noun 

Tense-aspect suinxes 



Case suffixes 
Plural suffix 
Adjective-Noun 
Genitive-Noun 



D Genitive-Noun 



D OV 



High cons/vowel ratio _^ tones 
No front-rounded vowels 



Negative affix 
Genitive-Noun 



D OV 



No front-rounded vowels _ , , . 

y , . , , D Large vowel quality mventory 
Labial velars 



Subordinating suffix _ „ ^ 
,r, ^ en ^ Postpositions 

Tense-aspect suitixes ^ 

No case affixes , 
. . D Initial subordinator word 
Prepositions 

^^™"^"^f^'™gDG^iti^e-Noun 
Plural surnx 



Table 3: Top implications discovered by the 
LingHier multi-conditional model. 

6.4 Multi-conditional Implications 

Many implications in the literature have multiple 
implicants. For instance, much research has gone 
into looking at which implications hold, considering 
only "VO" languages, or considering only languages 
with prepositions. It is straightforward to modify 
our model so that it searches over triples of features, 
conditioning on two and predicting the third. Space 
precludes an in-depth discussion of these results, but 
we present the top examples in Table[3](after remov- 
ing the tautalogically true examples, which are more 
numerous in this case, as well as examples that are 
directly obtainable from Table |2ll. It is encouraging 
that in the top 1000 multi-conditional implications 
found, the most frequently used were "OV" (176 
times) "Postpositions" (157 times) and "Adjective- 
Noun" (89 times). This result agrees with intuition. 

7 Discussion 

We have presented a Bayesian model for discover- 
ing universal linguistic implications from a typolog- 
ical database. Our model is able to account for noise 
in a linguistically plausible manner. Our hierarchi- 
cal models deal with the sampling issue in a unique 
way, by using prior knowledge about language fam- 
ilies to "group" related languages. Quantitatively, 
the hierarchical information turns out to be quite 
useful, regardless of whether it is phylogenetically- 
or areally-based. Qualitatively, our model can re- 



# Implicant D Implicand 


Analysis 


1 Postpositions D Genitive-Noun 

2 OV D Postpositions 

3 OV D Genitive-Noun 

4 Genitive-Noun D Postpositions 

5 Postpositions D OV 


Greenberg #2a 
Greenberg #4 

Greenberg #4 -I- Greenberg #2a 
Greenberg #2a (converse) 
Greenberg #2b (converse) 


6 SV D Genitive-Noun 

7 Adjective-Noun D Numeral-Noun 

8 Strongly suffixing D Tense-aspect suffixes 

9 VO D Noun-Relative Clause 
10 Interrogative verb morph D No question particle 


777 

Greenberg #18 
Clear explanation 
Lehmann 

Appeal to economy 


1 1 Numeral-Noun D Demonstrative-Noun 

12 Prepositions D VO 

13 Adjective-Noun D Demonstrative-Noun 

14 Noun- Adjective D Postpositions 

15 SV D Postpositions 


Hawkins XVI (for postpositional languages) 
Greenberg #3 (converse) 
Greenberg #18 
Lehmann 

777 


16 VO D Prepositions 

17 Initial subordinator word D Prepositions 

18 Strong prefixing D Prepositions 

19 Little affixation D Noun-Adjective 

20 Labial-velars D No uvular consonants 


Greenberg #3 

Operator-operand principle (Lehmann) 

Greenberg #27b 

777 

See text 


21 Negative word D No pronominal possessive affixes 

22 Strong prefixing D VO 

23 Subordinating suffix D Strongly suffixing 

24 Final subordinator word D Postpositions 

25 High and mid front vowels D Large vowel inventories 


See text 

Lehmann 

??? 

Operator-operand principle (Lehmann) 
See text 


26 Plural prefix D Noun-Genitive 

27 No fricatives D No tones 

28 Obligatory subject pronouns D No pronominal possessive affixes 

29 Demonstrative-Noun D Tense-aspect suffixes 

30 Prepositions D Noun-Relative clause 


777 
??? 

See text 

Operator-operand principle (Lehmann) 
Lehmann, Hawkins 



Table 2: Top 30 implications discovered by the LiNGHlER model. 
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cover many well-known implications as well as 
many more potential implications that can be the 
object of future linguistic study. We believe that 
our model is sufficiently general that it could be ap- 
plied to many different typological databases — we 
attempted not to "overfit" it to WALS. Our hope 
is that the automatic discovery of such implica- 
tions not only aid typologically-inclined linguists, 
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the amount of data field linguists need to collect. 
They have also been used computationally to aid 
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gers ([Schone and Jurafsky, 2001 ) . Many extensions 
are possible to this model; for instance attempting to 
uncover typologically hierarchies and other higher- 
order structures. We have made the full output of all 
models available at http : //hai3 . name/WALS 
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